What is the "Bitter Lesson"?

3 min read

Suggest changes in Google Docs

The Bitter Lesson is a thesis introduced by Rich Sutton stating that, for improving AI capabilities, “general methods that leverage computation are ultimately the most effective, and by a large margin”, compared to approaches that use human ingenuity.

Historically, AI research has mostly designed systems to use a fixed amount of computing power, improving their performance by applying domain-specific human knowledge. In theory, this approach is compatible with also improving performance by scaling (increasing computing power), but in practice, the complications introduced by leveraging human ingenuity make it harder to also leverage computation. Available computing power keeps growing steadily in accordance with Moore’s law, and past trends suggest that leveraging this growth is what increases performance in the long run.

Some fields that Sutton cites as examples of the Bitter Lesson are:

Games: In chess, Deep Blue beat world champion Garry Kasparov by using enormous computing power (for the time) to search deeply through the tree of possible moves. Similarly, in Go, AlphaGo beat world champion Lee Sedol using deep learning plus Monte Carlo tree search to find its moves, instead of using human-engineered Go techniques. Less than a year later, AlphaZero beat AlphaGo using self-play, without using human-generated Go data at all. None of these advances relied on humans coding in a deeper strategic understanding
Vision: Early computer vision methods worked with human-engineered features and convolution kernels to perform image recognition tasks, but it was later found that leveraging more compute and letting convolutional neural nets (CNNs) learn their own features yields much better performance.

Modern AI has learned to favor general-purpose methods of search and learning which continue to scale with increasing compute. Over the last couple of generations of transformer models, simply scaling models has been so effective that it has led OpenAI to propose scaling laws for language models in 2020, and DeepMind to improve on them in 2022. (However, whether scaling alone will lead to AGI is a different question.)

Can we get AGI by scaling up architectures similar to current ones, or are we missing key insights?

What are scaling laws?

What is compute?